Unique Entity Estimation with Application to the Syrian Conflict

نویسندگان

  • Beidi Chen
  • Anshumali Shrivastava
  • Rebecca C. Steorts
چکیده

Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of 191, 874 ± 1772 documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Human organ absorbed dose estimation of 166Ho-BPAMD complex based on biodistribution data of male Syrian rats

Introduction: Recently, 166Ho-BPAMD was introduced as a suitable agent for bone marrow ablation. The aim of this study was to estimate the absorbed dose of this novel agent in the human organs which is necessary before the clinical application. Methods: 166Ho was produced by direct irradiation of 165Ho in the research reactor. 250 µg...

متن کامل

Blocking Methods Applied to Casualty Records from the Syrian Conflict

Estimation of death counts and associated standard errors is of great importance in armed conflict such as the ongoing violence in Syria, as well as historical conflicts in Guatemala, Perú, Colombia, Timor Leste, and Kosovo. For example, statistical estimates of death counts were cited as important evidence in the trial of General Efráın Ŕıos Montt for acts of genocide in Guatemala. Estimation ...

متن کامل

Strategies for internal, regional and international trilogy in the Syrian conflict

Background and Aim: Arab revolutions occurred in late 2010, and the influence of these movements came from Deraa in Syria. Given that the Syrian crisis is a multifaceted, complex, and diverse political and security crisis with conflicting goals, then This research divides actors into three levels of internal, regional, and international in order to accurately explain the subject. Each domestic ...

متن کامل

Prevalence of depression in Syrian refugees and the influence of religiosity.

BACKGROUND Many surveys have underlined the high levels of distress Syrian refugees have endured since the conflict aroused in their country, yet few have used reliable diagnostic tools for the clinical assessment of resulting mental disorders. The aim of our study is to assess for the onset of new depressive disorders following the Syrian war, and to investigate the correlation of religiosity ...

متن کامل

Indian Childhood Cirrhosis: Case Report and Pediatric Diagnostic Challenges

Introduction: Indian childhood cirrhosis is a chronic liver disease usually seen in paediatric age group and is unique to the Indian subcontinent. The definitive causative factor for the disease is not found till now but excess copper ingestion has been associated with it.Case presentation: An Indian origin one and half year old premorbidly normal male child presented with history of gradual di...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1710.02690  شماره 

صفحات  -

تاریخ انتشار 2017